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Abstract 

The problem of joint universal source coding and identification is considered in the setting of fixed-rate lossy 
coding of continuous-alphabet memoryless sources. For a wide class of bounded distortion measures, it is shown 
that any compactly parametrized family of M'^-valued i.i.d. sources with absolutely continuous distributions satisfying 
appropriate smoothness and Vapnik-Chervonenkis learnability conditions, admits a joint scheme for universal lossy 
block coding and parameter estimation, such that when the block length n tends to infinity, the overhead per-letter 
rate and the distortion redundancies converge to zero as 0{n~^ logn) and logn), respectively. Moreover, 

the active source can be determined at the decoder up to a ball of radius 0(\/ n~^ logn) in variational distance, 
asymptotically almost surely. The system has finite memory length equal to the block length, and can be thought 
of as blockwise application of a time-invariant nonlinear filter with initial conditions determined from the previous 
block. Comparisons are presented with several existing schemes for universal vector quantization, which do not 
include parameter estimation explicitly, and an extension to unbounded distortion measures is outlined. Finally, 
finite mixture classes and exponential families are given as explicit examples of parametric sources admitting joint 
universal compression and modeling schemes of the kind studied here. 

Keywords: Learning, minimum-distance density estimation, two-stage codes, universal vector quantization, Vapnik- 
Chervonenkis dimension. 


I. Introduction 

In a series of influential papers [I]-[3], Rissanen has elucidated and analyzed deep connections between universal 
lossless coding and statistical modeling. His approach hinges on the following two key insights: 

1) A given parametric class of information sources admits universal lossless codes if (a) the statistics of each 
source in the class (or, equivalently, the parameters of the source) can be determined with arbitrary precision 
from a sufficiently long data sequence and if (b) the parameter space can be partitioned into a finite number 
of subsets, such that the sources whose parameters lie in the same subset are “equivalent” in the sense of 
requiring “similar” optimal coding schemes. This idea extends naturally to hierarchical model classes (e.g., 
when the dimension of the parameter vector is unknown), provided that the parametric family of sources 
governed by each model satisfies the above regularity conditions individually. 

2) Given a sequence of symbols emitted by an information source from a hierarchical model class, an asymptot¬ 
ically correct model of the source is obtained by finding the best trade-off between the number of bits needed 
to describe it and the number of bits needed to losslessly encode the data assuming that the data are drawn 
from the maximum-likelihood distribution relative to this model. This is the basis of the so-called Minimum 
Description Length (MDL) principle for model selection and, more generally, statistical inference (see, e.g., 
the survey article of Barron, Rissanen and Yu [4] or the recent book by Griinwald [5]). 

There is, in fact, a natural symmetry between these two insights, owing to the well-known one-to-one correspondence 
between (almost) optimal lossless codes and probability distributions on the space of all input sequences [6]. For this 
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reason, when considering universal lossless coding, we can use the term “model” to refer either to the prohahility 
distribution of the source or to an optimal lossless code for the source, where we allow codes with ideal (noninteger) 
codeword lenghts. The main point of Rissanen’s approach is precisely that the ohjectives of source coding and 
modeling can he accomplished jointly and in an asymptotically optimal manner. 

Consider the case of a parametric class of sources where the parameter space has finite dimension k. Then the 
redundancy of the corresponding universal lossless code (i.e., the excess average codelength relative to the optimal 
code for the actual source at a given block length) is controlled essentially by the number of bits required to describe 
the source parameters to the decoder. In particular, the achievability theorem of Rissanen [1, Theorem lb] states 
that one can use a scheme of this kind to achieve the redundancy of about {k/2) logn/n bits per symbol, where n 
is the block length. The universal lossless coder used by Rissanen in [1] operates as follows: first, the input data 
sequence is used to compute the maximum-likelihood estimate of the parameters of the source, then the estimate is 
quantized to a suitable resolution, and finally fhe data are encoded with the corresponding optimum lossless code. 
Structurally, this is an example of a two-stage code, in which the binary description of the input data sequence 
produced by the encoder consists of two parts: the first part describes the (quantized) maximum-likelihood estimate 
of the source parameters, while the second part describes the data using the code matched to the estimated source. 

In this paper, we investigate achievable redundancies in schemes for joint source coding and identification (modeling) 
in the setting of fixed-rate universal lossy block coding (vector quantization) of continuous-alphabet memoryless 
sources. Once we pass from lossless codes to lossy ones, the term “model” can refer either to a probabilistic 
description of the source or to a probability distribution over codebooks in the reproduction space. In particular, 
whereas choosing a lossless code for an information source is equivalent to choosing a probabilistic model of the 
source, choosing a lossy code corresponds in a certain sense to sampling from a discrete probability distribution 
over sequences in the reproduction alphabet, and is thus related to the source distribution only indirectly. To place 
the present work in a wider context, in Section |Vl] we briefly commenf on fhe line of research concerned with 
relating lossy codes to codebook models, which can be thought of as a lossy variant of the MDL principle. However, 
there are situations in which one would like to compress the source and identify its statistics at the same time. For 
instance, in indirect adaptive control (see, e.g.. Chapter 7 of Tao [7]) the parameters of the plant (the controlled 
system) are estimated on the basis of observation, and the controller is modified accordingly. Consider fhe discrefe- 
time stochastic setting, in which the plant state sequence is a random process whose statistics are governed by a 
finite set of parameters. Suppose that the controller is geographically separated from the plant and connected to it 
via a noiseless digital channel whose capacity is R bits per use. Then, given the time horizon T, the objective is to 
design an encoder and a decoder for the controller to obtain reliable estimates of both the plant parameters and the 
plant state sequence from the 2^^ possible outputs of the decoder. In this paper, we are concerned with modeling 
the actual source directly, and not through a codebook distribution in the reproduction space. 

The objective of universal lossy coding (see, e.g., [8]-[13]) is to construct lossy block source codes (vector 
quantizers) that perform well in incompletely or inaccurately specified statistical environments. Roughly speaking, 
a sequence of vector quantizers is universal for a given class of information sources if it has asymptotically optimal 
performance, in the sense of minimizing the average distortion under the rate constraint, on any source in the 
class. Two-stage codes have also proved quite useful in universal lossy coding [10], [11], [13]. For instance, the 
two-stage universal quantizer introduced by Chou, Effros and Gray [13] is similar in spirit to the adaptive lossless 
coder of Rice and Flaunt [14], [15], known as the “Rice machine”: each input data sequence is encoded in parallel 
with a number of codes, where each code is matched to one of the finitely many “representative” sources, and 
the code that performs the best on the given sequence (in the case of lossy codes, compresses it with the smallest 
amount of distortion) wins. Similar to the setting of Rissanen’s achievability theorem, the approach of [13] assumes a 
sufficiently smooth dependence of optimum coding schemes on the parameters of the source. However, the decision 
rule used in selection of the second-stage code does not rely on explicit modeling of the source statistics as the 
second-stage code is chosen on the basis of local (pointwise), rather than average, behavior of the data sequence 
with respect to a fixed collecfion of quantizers. This approach emphasizes fhe coding objective at the expense of 
the modeling objective, thus falling short of exhibiting a relation between the latter and the former. 

In the present work, we consider parametric spaces {Pe} of i.i.d. sources with values in M'^, such that the P^’s are 
absolutely continuous and the parameter 9 belongs to a bounded subset of We show in a constructive manner that. 
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for a wide class of bounded distortion functions and under certain regularity conditions, such parametric families 
admit universal sequences of quantizers with distortion redundancies converging to zero as logn) and 

with an overhead per-letter rate converging to zero as 0{n~^ logn), as the block length n —> oo. These convergence 
rates are, more or less, typical for universal coding schemes relying on explicit or implicit acquisition of the statistical 
model of the source (cf. the discussion in Section ITVl of this paper). For unbounded distortion functions satisfying a 
certain moment condition with respect to a fixed reference letter, the distortion redundancies are shown to converge 
to zero as logn). The novel feature of our method, however, is that the decoder can use the two-stage 

binary description of the data not only to reconstruct the data with asymptotically optimal fidelity, but also to 
identify the active source up to a variational ball of radius 0(y^n~^'To^) with probability approaching unity. In 
fact, the universality and the rate of convergence of the compression scheme are directly tied to the performance 
of the source identification procedure. 

While our approach parallels Rissanen’s method for proving his achievability theorem in [1], there are two important 
differences with regard to both his work on lossless codes and subsequent work by others on universal lossy codes. 
The first difference is that the maximum-likelihood estimate, which fits naturally into the lossless framework, 
is no longer appropriate in the lossy case. In order to relate coding to modeling, we require that the probability 
distributions of the sources under consideration behave smoothly as functions of the parameter vectors; for compactly 
parametrized sources with absolutely continuous probability distributions, this smoothness condition is stated as a 
local Lipschitz property in terms of the Li distance between the probability densities of the sources and the Euclidean 
distance in the parameter space. For bounded distortion measures, this implies that the expected performance of 
the corresponding optimum coding schemes also exhibits smooth dependence on the parameters. (By contrast, 
Chou, Effros and Gray [13] impose the smoothness condition directly on the optimum codes. This point will be 
elaborated upon in Section |III1) Now, one can construct examples of sources with absolutely continuous probability 
distributions for which the maximum-likelihood estimate behaves rather poorly in terms of the Li distance between 
the true and the estimated probability densities [16]. Instead, we propose the use of the so-called minimum-distance 
estimate, introduced by Devroye and Eugosi [17], [18] in the context of kernel density estimation. The introduction 
of the minimum-distance estimate allows us to draw upon the powerful machinery of Vapnik-Chervonenkis theory 
(see, e.g., [19] and Appendix [^ in this paper) both for estimating the convergence rates of density estimates and 
distortion redundancies, as well as for characterizing the classes of sources that admit joint universal coding and 
identification schemes. The merging of Vapnik-Chervonenkis techniques with two-stage coding further underscores 
the forward relation between statistical learning/modeling and universal lossy coding. 

The second difference is that, unlike previously proposed schemes, our two-stage code has nonzero memory length. 
The use of memory is dictated by the need to force the code selection procedure to be blockwise causal and robust 
to local variations in the behavior of data sequences produced by “similar” sources. For a given block length n, the 
stream of input symbols is parsed into contiguous blocks of length n, and each block is quantized with a quantizer 
matched to the source with the parameters estimated from the preceding block. In other words, the coding process 
can be thought of as blockwise application of a nonlinear time-invariant filter with initial conditions determined 
by the preceding block. In the terminology of Neuhoff and Gilbert [20], this is an instance of a block-stationary 
causal source code. 

The remainder of the paper is organized as follows. In Section |IIJ we state the basic notions of universal lossy coding 
specialized to block codes with finite memory. Two-stage codes with memory are introduced in Section |III] and 
placed in the context of statistical modeling and parameter estimation. The main result of this paper, Theorem l3.2[ is 
also stated and proved in Section |III1 Next, in Section |IVl we present comparisons of our two-stage coding technique 
with several existing techniques, as well as discuss some generalizations and extensions. In Section |V] we show 
that two well-known types of parametric sources — namely, mixture classes and exponential families — satisfy, 
under mild regularity requirements, the conditions of our main theorem and thus admit joint universal quantization 
and identification schemes. Section |Vl] offers a quick summary of the paper, together with a list of potential topics 
for future research. Appendix contains a telegraphic summary of notions and results from Vapnik-Chervonenkis 
theory. Appendices |Bl O and |D] are devoted to proofs of certain technical results used throughout the paper. 

*The distortion redundancy of a lossy block code relative to a source is the excess distortion of the code compared to the optimum code 
for that source. 
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II. Preliminaries 


Let be a memoryless stationary source with alphabet X (the source alphabet), i.e., the X/s are 

independent and identically distributed (i.i.d.) random variables with values in X. Suppose that the common 
distribution of the X^’s belongs to a given indexed class {Pg : ^ G ©} of probability measures on X (with 
respect to an appropriate fj-field). The distributions on the n-blocks X'^ = (Xi, • • • ,X„) will be denoted by Pg. 
The superscript n will be dropped whenever it is clear from the argument, such as in Pq{x"'). Expectations with 
respect to the corresponding process distributions will be denoted by E 6 )[-], e.g., E 0 [X”]. In this paper, we assume 
that X is a Borel subset of W^, although this qualification is not required in the rest of the present section. 

Consider coding {Xi} into another process {Xi} with alphabet X (the reproduction alphabet) by means of a 
finite-memory stationary block code. Given any m,n,f G Z with m,n > 1, let X^(f) denote the segment 

{Xtn —m+li Xifi—m+2: ' ' ' > ^tn) 

of {Xi}. When n = shall abbreviate this notation to X”(t); when m = 1, we shall write X„(f); finally, when 

f = 1 , we shall write X^, X^, Xm- A code with block length n and memory length m [or an (n, m)-block code, 
for short] is then described as follows. Each reproduction n-block X"'{t), f G Z, is a function of the corresponding 
source n-block X”(f), as well as of X^(f — 1), the m source symbols immediately preceding X”(t), and this 
function is independent of t: 

x”(t) =C"-”^(X"(t),X"(t-l)), VfGZ. 

When the code has zero memory, i.e., m = 0, we shall denote it more compactly by C”. The performance of the 
code is measured in terms of a single-letter distortion (or fidelity criterion), i.e., a measurable map p : X x X ^ M+. 
The loss incurred in reproducing a string x” G X” by x” G X” is given by 

n 

p(x’^,x’^) = '^p{Xi,Xi). 
i=l 

When the statistics of the source are described by Pg, the average per-letter distortion of is defined as 

Dg(C^’^) = limsupiE 0 [/J(X^X^)] 

fc— ^ 

k 

= limsupi^E 0 [/j(Xi,Xi)], 

k^oo ^ 

where the X/s are determined from the rule X^(t) = C'”’"^(X’^(f), X^(f — 1)) for all f G Z. Since the source is 
i.i.d., hence stationary, for each 0 G 0, both the reproduction process {Xj} and the pair process {(Xj,Xj)} are 
n-stationary, i.e., the vector processes {X"'{t)}^_^ and {(X^{t), X^(t))}'^_^ are stationary [20]. This implies 
[ 21 ] that 

1 t ^ ^ 

Dg(C^n = -E,[p(X-,X-)] = - Ve,[p(X„X,)], 

n n 

2 = 1 

where X” = C”’"^(X”, X;;,( 0 )). 

More specifically, we shall consider fixed-rate lossy block codes (also referred to as vector quantizers). A fixed-rate 
lossy (n, m)-block code is a pair (/, fi) consisting of an encoder / : X"^ x X™ ^ S and a decoder : S ^ X”, 
where S C {0,1}* is a collection of fixed-length binary strings. The quantizer function : X"^ x X™ ^ X” 

is the composite map (/> o /; we shall often abuse notation, denoting by also the pair (f,(j)). The number 

_ 7^-1 log |5| is Called the rate of G”’™, in bits per letter (unless specified otherwise, all logarithms in 
this paper will be taken to base 2). The set T = {</>(«) : s G 5} is the reproduction codebook of G”-™. 

The optimum performance achievable on the source Pg by any finite-memory code with block length n is given 
by the nth-order operational distortion-rate function (DRE) 

Dg’*{R) = inf inf {Dg{C^’"^) : R{C^’^) < R}, 
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where the infimum is over all finite-memory block codes with block length n and with rate at most R bits per 
letter. If we restrict the codes to have zero memory, then the corresponding nth-order performance is given by 

%(i?) = inf {De{C^) : R{C^) < R). 

Clearly, Dg’*{R) < Dg{R). However, as far as optimal performance goes, allowing nonzero memory length does 
not help, as the following elementary lemma shows: 

Lemma 2.1. Dg’*{R) = D^{R). 

Proof: It suffices to show that Dq{R) < Dq'*{R). Consider an arbitrary (n, m)-block code C'"’'” = fo f, 
f : X X^ S, (p : S ^ X^. We claim that there exists a zero-memory code Cf = 0* o : X"- S, 

0* : 5 ^ X"', such that 

for all x"^ G X'^,z"^ G X"^. Indeed, define /* as the minimum-distortion encoder 

x” 1 -^ argmin/)(x"', (/>(s)) 
s^S 

for the reproduction codebook of and let (?!>*(s) = (j){s). Then it is easy to see that R{C^) < R(C^’^) and 

Do{Cf) < Do{C'^''^) for all 0 G 0, and the lemma is proved. ■ 

Armed with this lemma, we can compare the performance of all fixed-rate quantizers with block length n, with 
or without memory, to the nth-order operational DRF Dg{R). If we allow the block length to grow, then the best 
performance that can be achieved by a fixed-rate quantizer with or without memory on the source Pq is given by 
the operational distortion-rate function 

Dg{R) = inf= lim D^{R). 

n n—>oo 

Since an i.i.d. source is stationary and ergodic, the source coding theorem and its converse [22, Ch. 9] guarantee 
that the operational DRF Dg{R) is equal to the Shannon DRF Dg{R), which in the i.i.d. case admits the following 
single-letter characterization: 

De{R) ^\ni{W.p^Q[p{X,X)]-. Ip^q{X,X) <R]. 

Here, the infimum is taken over all conditional probabilities (or test channels) Q from X to X, so that PgQ is the 
corresponding joint probability on X x X, and I is the mutual information. 

A universal lossy coding scheme at rate R for the class {Pg : 0 G 0} is a sequence of codes where 

n = 1, 2, • • • and m is either a constant or a function of n, such that for each 6* G 0, R{C^’"^) and Dg{C^’'^) 
converge to R and Dg{R), respectively, as n ^ oo. Depending on the mode of convergence with respect to 9, 
one gets different types of universal codes. Specifically, let be a sequence of lossy codes satisfying 

R as n ^ oo. Then, following [9], we can distinguish between the following three types of universality: 

Definition 2.1 (weighted universal). is weighted universal for {Pg : 0 G 0} with respect to a probability 

distribution PF on 0 (on an appropriate cj-field) if the distortion redundancy 

5g{C^’^) = Dg{C^’^) - Dg{R) 

converges to zero in the mean, i.e., 

lim [ Sg{C^’"^)dW{e) = 0. 

Je 

Definition 2.2 (weakly minimax universal). is weakly minimax universal for {Pg : 9 G 0} if 

lim dgiC'^^^) = 0 

n^oQ 

for each 0 G 0, i.e., 6g{C'^’'^) converges to zero pointwise in 9. 
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Definition 2.3 (strongly minimax universal). is strongly minimax universal for {Pg : 9 G 0} if the 

convergence of 6g(C^’^) to zero as n —> oo is uniform in 9. 

The various relationships between the three types of universality have been explored in detail, e.g., in [9]. From the 
practical viewpoint, the differences between them are rather insubstantial. For instance, the existence of a weighted 
universal sequence of codes for {Pg : 0 G 0} with respect to W implies, for any e > 0 the existence of a strongly 
minimax universal sequence for {Pg : 9 G 0e} for some 0^ C 0 satisfying W{Qfi) > 1 — e. In this paper, we shall 
concentrate exclusively on weakly minimax universal codes. 

Once the existence of a universal sequence of codes is established in an appropriate sense, we can proceed to 
determine the rate of convergence. To facilitate this, we shall follow Chou, Effros and Gray [13] and split the 
redundancy Sg(C^’^) into two nonnegative terms: 

5g{C^n = {Dg{c^’^) - Wg{R)) + {Wg{R) - Dg(R)). (2.1) 

The first term, which we shall call the nth-order redundancy and denote by 6^(0’^’'^), quantifies the difference 
between the performance of and the nth-order operational DRF, while the second term tells us by how 

much the nth-order operational DRF exceeds the Shannon DRF, with respect to the source Pg. Note that 6g{C^’"^) 
converges to zero if and only if does, because Dg{R) —>■ Dg(R) as n —> oo by the source coding 

theorem. Thus, in proving the existence of universal codes, we shall determine the rates at which the two terms on 
the right-hand side of (12.1b converge to zero as n —> oo. 

III. Two-stage joint universal coding and modeling 

As discussed in the Introduction, two-stage codes are both practically and conceptually appealing for analysis and 
design of universal codes. A two-stage lossy block code (vector quantizer) with block length n is a code that 
describes each source sequence in two stages: in the first stage, a quantizer of block length n is chosen as a 
function of from some collection of available quantizers; this is followed by the second stage, in which x'^ is 
encoded with the chosen code. 

In precise terms, a two-stage fixed-rate lossy code is defined as follows [13]. Lef / : A" —> 5 be a mapping of 
into a collection S of fixed-length binary strings, and assume that to each s G 5 there corresponds an n-block 
code C~ = (fs, at rate of R bits per letter. A two-stage code is defined by fhe encoder 

fix-) = Jix-)fj^^^p-) 

and fhe decoder ^ 

Here fhe juxfaposifion of fwo binary sfrings sfands for fheir concafenafion. The map / is called the first-stage 
encoder. The rate of this code is i? -|- log |5| bits per letter, while the instantaneous distortion is 

pix-,C-ix-))=pix-,C-~^^^^ix-)). 

Now consider using C” to code an i.i.d. process {Xi} with all the Xfs distributed according to Pg for some 0 G 0. 
This will result in the average per-letter distortion 

DgiC-) = ^Eg[piX-,C-iX-))] 

= ^Eg[piX-,Cf^^^^iX-))] 

= pix-,Cf^^^^ix-))dPgix-). 

Note that it is not possible to express DgiC-) in terms of expected distortion of any single code because the 
identity of the code used to encode each x- G X- itself varies with x”. 

Let us consider the following modification of two-stage coding. As before, we wish to code an i.i.d. source {Xi} 
with an n-block lossy code, but this time we allow the code to have finite memory m. Assume once again that 


6 


we have an indexed collection {C~ : s G 5} of n-block codes, but this time the first-stage encoder is a map 
/ : —> S from the space of m-blocks over X into S. In order to encode the current n-block X^{t), f G Z, 

the encoder first looks at X^{t — 1), the m-block immediately preceding X^{t), selects a code C~ according to 
the rule s = f{X^{t — 1)), and then codes X”(f) with that code. In this way, we have a two-stage (n, m)-block 
code with the encoder 

and the decoder ^ 

The operation of this code can be pictured as a blockwise application of a nonlinear time-invariant filter with the 
initial conditions determined by a fixed finite amount of past data. Just as in the memoryless case, the rate of 
is R + n~^ log |5| bits per letter, but the instantaneous distortion is now given by 


When the common distribution of the Xi’s is Pq, the average per-letter distortion is given by 

DeiC^n = - f pix^^Cl {x^))dPe{x^,z^) 


n 

= Ee 


^EejE, [p{X^,CJ^^^^iX^))\z^]} 




(3.2) 


Observe that the use of memory allows us to decouple the choice of the code from the actual encoding operation, 
which in turn leads to an expression for the average distortion of C'”’'” that involves iterated expectations. 


Intuitively, this scheme will yield a universal code if 


Eg 



C 




mm 


(3.3) 


for each 0 G 0. Keeping in mind that /(X™) is allowed to take only a finite number |5| of values, we see that 
condition (13.31 ) must be achieved through some combination of parameter estimation and quantization. To this end, 
we impose additional structure on the map /. Namely, we assume that it is composed of a parameter estimator 
9 : ^ 0 that uses the past data — 1) to estimate the parameter label 0 G 0 of the source in effect, and 

a lossy parameter encoder p : 0 —> 5, whereby the estimate 9 is quantized to i? = log |5| bits, with respect to a 
suitable distortion measure on 0. A binary description of the quantized version 9 of 9{X^{t — 1)) is then passed 
on to the second-stage encoder which will quantize the current n-block X”(f) with an n-block code matched to 
P^. Provided that Pq and Pg are “close” to each other in an appropriate sense, the resulting performance will be 
almost as good as if the actual parameter 9 were known all along. As a bonus, the decoder will also receive a good 
i?-bit binary representation (model) of the source in effect. Therefore, we shall also define a parameter decoder 
V’ ; 5 —> 0, so fhaf 9 = il;{f{X^{t — 1)) can be faken as an esfimafe of the parameter 0 G 0 of the active 
source. The structure of the encoder and the decoder in this two-stage scheme for joint modeling and lossy coding 
is displayed in Fig. [T] 

These ideas are formalized in Theorem 13.21 below for i.i.d. vector sources {Xj}, Xj G where the common 
distribution of the Xj’s is a member of a given indexed class {Pg : 0 G 0} of absolutely continuous distributions, 
and the parameter space 0 is a bounded subset of For simplicity we have set m = n, although other choices 
for the memory length are also possible. Before we state and prove the theorem, let us fix some useful results and 
notation. The following proposition generalizes Theorem 2 of Linder, Lugosi and Zeger [11] to i.i.d. vector sources 
and characterizes the rate at which the nth-order operational DRF converges to the Shannon DRF (the proof, which 
uses Csiszar’s generalized parametric representation of the DRF [23], as well as a combination of standard random 
coding arguments and large-deviation estimates, is an almost verbatim adaptation of the proof of Linder et al. to 
vector sources, and is presented for completeness in Appendix iBl): 
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Second-stage encoder 



First-stage encoder 



Fig. 1. Two-stage scheme for joint modeling and lossy coding (top: encoder; bottom: decoder). 


Proposition 3.1. Let {Xi} be an i.i.d. source with alphabet X C where the common distribution of the Xi’s 
comes from an indexed class {Pg : 9 G 0}. Let p : X x X ^ M+ be a distortion function satisfying the following 
two conditions: 

1) inf^/>(x,x) = 0 for all x e X. 

x£X 

2) sup ^p{x,x) = Pmax < OO. 

Then for every 6 ^ Q and every ii > 0 such that Dg{R) > 0 there exists a constant cg{R) such that 

D^{R)-De{R) < {ce{R)+o{l)) 

The function cg{R) is continuous in R, and the o(l) term converges to zero uniformly in a sufficiently small 
neighborhood of R. 

Remark 3.1. The condition Dg{R) > 0 is essential to the proof and holds for all i? > 0 whenever Pg has a 
continuous component, which is assumed in the following. 

Remark 3.2. The constant cg{R) depends on the derivative of the DRF Dg{R) at R and on the maximum value 
Pmax of the distortion function. 



The distance between two i.i.d. sources will be measured by the variational distance between their respective 







V0, r] € Q 


single-letter distributions [19, Ch. 5]: 

dv{Pe,Prj) = snp\Pe{B) - 
B 

where the supremum is taken over all Borel subsets of Also, given a sequence of real-valued random variables 
Vi, V 2 , ■ ■ • and a sequence of nonnegative numbers ai, 02 , • • •, the notation I 4 = 0{an) a.s. means that there exist 
a constant c > 0 and a nonnegative random variable N £ 7a such that I 4 < can for all n > N. Finally, both the 
statement and the proof of the theorem rely on certain notions from Vapnik-Chervonenkis theory; for the reader’s 
convenience. Appendix contains a summary of the necessary definitions and results. 


Theorem 3.2. Let be an i.i.d. source with alphabet X C M'^, where the common distribution of the Aj’s 

is a member of a class ; 0 G 0} of absolutely continuous distributions with the corresponding densities pg. 
Assume the following conditions are satisfied: 


1 ) 0 is a bounded subset of 

2) The map 6 Pg is uniformly locally Lipschitz: there exist constants r > 0 and m > 0 such that, for each 
0 G 0, 

dv{Pe,Pv) < m\\e - rj\\ 

for all p G Br{9), where || • || is the Euclidean norm on and Br{0) is an open ball of radius r centered 
at 6. 

3) The Yatracos class [17], [18], [24] associated with 0, defined as 

= {x eX ■. Pg{x) > Pg{x)} : 0, r/ G 0; 0 / r/|, 
is a Vapnik-Chervonenkis class, V(,A 0 ) = V < 00 . 

Let p : X X X ^ M'*' be a single-letter distortion function of the form p{x, x) = [d{x, x)]^ for some p > 0, 
where d{-, •) is a bounded metric on X U X. Suppose that for each n and each 0 G 0 there exists an n-block code 
Cg = {fg, (j)g) at rate of 7? > 0 bits per letter that achieves the reth-order operational DRF for Pg-. Dg{Cg) = Dq{R). 
Then there exists an (n, n)-block code with 


such that for every 0 G 0 


= R + o , 


(3.4) 


(3.5) 


The resulting sequence of codes is therefore weakly minimax universal for {Pg : 0 G 0} at rate R. 

Furthermore, for each n the first-stage encoder / and the corresponding parameter decoder ijj are such that 


dv{Pe, ^ 



a.s., 


(3.6) 


where the probability is with respect to Pg. The constants implicit in the O(-) notation in (13.41) and (13.61) are 
independent of 0. 

Proof: The theorem will be proved by construction of a two-stage code, where the first-stage encoder / : 
X^ —> 5 is a cascade of the parameter estimator 0 : X'^ —> 0 and the lossy parameter encoder g : Q ^ S. 
Estimation of the parameter vector 0 at the decoder will be facilitated by the corresponding decoder V’ : 5 —> 0. 


Our parameter estimator will be based on the so-called minimum-distance density estimator [16, Sec. 5.5], originally 
developed by Devroye and Lugosi [17], [18] in the context of kernel density estimation. It is constructed as follows. 
Let Z” = (Zi, • • • , Zn) be i.i.d. according to Pg for some 6 £ Q. Given any 77 G 0 , let 


A^(Z")= sup \Pg{A) - Pz4A)\ , 
A.g.A.@ 
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where Pz^ is the empirical distrihution of Z'^, 


Pz^{B) = - 

2=1 


for any Borel set B. Define 6 {Z^) as any 6 * £ Q satisfying 

Ae.iZ^) < inf + 

r7G0 n 


where the extra 1/n term has been added to ensure that at least one such 9* exists. Then Pq(^z^) called the 

minimum-distance estimate of P^. Through an abuse of terminology, we shall also say that 9 is the minimum- 
distance estimate of 9. The key property of the minimum-distance estimate [16, Thm. 5.13] is that 

|P 0 (Z")(®) -Pe{x)\dx < AA 0 {Z^) + (3.7) 

Since the variational distance between any two absolutely continuous distributions P, Q on is equal to one half 
of the Li distance between their respective densities p,q [19, Thm. 5.1], i.e.. 


dy(P, g) = i / \p{x) - q{x)\dx, 


we can rewrite (IQ) as 

Since .A© is a Vapnik-Chervonenkis class. Lemma IA.2I in the Appendices asserts that 


(3.8) 


E,[A,(Z-)] (3.9) 

V n 

where ci is a constant that depends only on the VC dimension of A^©. Taking expectations of both sides of (13.81) 
and applying (13.91) . we get _ 


Eg 


dv{P9l 


< 2ci 


log re 


re 



(3.10) 


Next, we construct the lossy encoder g. Since 0 is bounded, it is contained in some hypercube M of side J, where 
J is some positive integer. Let M 2 '^\ ■ ■ ■ ,M^^} be a partitioning of M into nonoverlapping 

hypercubes of side l/[re^/^], so that K < (Jre^/^)^. Represent each that intersects 0 by a unique fixed- 
length binary string Sj, and let S = {Sj}. Then if a given 0 G 0 is contained in Mj'^\ map it to Sj, g{9) = Sj', this 
choice can be described by a string of no more than A:(logre^/^ -flog J) bits. Finally, for each that intersects 

^ (n) ^ 

0, choose a representative 9j G ' n 0 and define the corresponding re-block code C~ to be C”. Thus, we can 
associate to Ij the decoder -i/) : 5 —> 0 via V^(sj) = 9j. 

Now let us describe and analyze the operation of the resulting two-stage (re, re)-block code In order to keep 

the notation simple, we shall suppress the discrete time variable t and denote the current block X^{t) by X'^, while 
the preceding block X^{t — 1) will be denoted by Z'^. The first-stage encoder / computes the minimum-distance 
estimate 9 = 9{Z^) and communicates its lossy binary description J = f{Z^) = g{9{Z'^)) to the second-stage 
encoder. The second-stage encoder then encodes X'^ with the re-block code C~ = (77, where 9 = 'il;(s) is the 

quantized version of the minimum-distance estimate 9. The string transmitted to the decoder thus consists of two 
parts: the header J, which specifies the second-stage code C~, and the body s = fs{X"'), which is the encoding 
of X"' under C~. The decoder computes the reproduction X"' = (psis)- Note, however, that the header s not 
only instructs the decoder how to decode the body s, but also contains a binary description of the quantized 
minimum-distance estimate of the active source, which can be recovered by means of the rule 9 = ip(s). 

In order to keep the notation simple, assume for now that p = 1, i.e., the distortion function p is a metric on AU A 
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with the hound /Jmax; the case of general p is similar. The rate of is clearly no more than 

A:(logre^/^ + log J) 

a H- 

n 

hits per letter, which proves (I3.41) . The average-per letter distortion of on the source Pq is, in accordance with 
(13.21) . given hy 




Do C 




where C~ = C- with 6 = 'ijj{f{Z'^)). Without loss of generality, we may assume that each Cg is a nearest- 
neighbor quantizer, i.e., 

p{x'^,CJ}{x^)) = min 

X^-GTg 

for all x^ G where T^ is the codehook of Cg. Then we have the following chain of estimates: 

DoiC^) < D^{C^) + 2p^^^dv{Pe,Pg) 

D^{R) + 2p^,^dviPe,Pg) 

< miR)+ip max dviPe, Pg) 

(d) ^ 

< D'^{R) + 4pmax [dv {Pd , Pg) + dy {Pg, Pg)] , 


where (a) and (c) follow from a basic quantizer mismatch estimate (Lemma 1C. II in the Appendices), (b) follows 
from the assumed nth-order optimality of for P^, while (d) is a routine application of the triangle inequality. 
Taking expectations, we get 


< D'§{R) + ^p^^^{^g[dv{Pe,Pg)]+Mdv{Pg,PM . (3.11) 


We now estimate separately each term in the curly brackets in (13.1 lb . The first term can be bounded using the fact 
that 6 = 9{Z'^) is a minimum-distance estimate of 6 , so by (13.101 ) we have 


Mdv{P9,Pg)] < 1^- (3-12) 

The second term involves 6 and its quantized version 9, which satisfy ||0 — 0|| < ^kln, by construction of the 
parameter space quantizer (g, 'll;). By the uniform local Lipschitz property of the map 9 ^ Pg, there exist constants 
r > 0 and m > 0, such that 

dv{Pn,Pg) < m\\p - 9\\ 

for all p G Br{9). If 0 G Br{9), this implies that dv{Pg, Pg) < m^/k/n. Suppose, on the other hand, that 9 0 Br{9). 
By assumption, ||0 — 0|| > r. Therefore, since dv{-, ■) is bounded from above by unity, we can write 


Let b = max(m, 1/r). Then the above argument implies that 

MPi.Pg) < bsjl 

and consequently 

Mdv{Pg,Pg)]<h^J^. 


(3.13) 

(3.14) 
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Substituting the bounds (13.121 ) and (13.141) into (13.1 lb yields 


< D^{R) + (8ciJ^+ - + , 

\ V n n V n / 

whence it follows that the nth-order redundancy 5^ (C”) = 0{\/ n~^ logn) for every 0 G 0. Then the decomposition 

+ D^{R) - De{R) 

and Proposition 13.11 imply that (13.5b holds for every 0 G 0. The case of p 1 is similar. 

To prove (13.6b . fix an e > 0 and note that by (13.8b . (13.13b and the triangle inequality, dv{Pe, P ^ implies 

that 

2A«(Z”) + 4oyf>.. 

Hence, 



where C 2 = 3/2 + bVk. Therefore, by Lemma lA^ 

P {dy(P0, > e} < (3.15) 

If for each n we choose > y^l28yInn/n + C 2 y/ljn, then the right-hand side of (13.15b will be summable in 
n, hence dv{Pe, Pgf^z^)) ~ 0(\/ n~^ log n) a.s. by the Borel-Cantelli lemma. ■ 

Remark 3.3. Our proof combines the techniques of Rissanen [1], in that the second-stage code is selected through 
explicit estimation of the source parameters, and of Chou, Effros and Gray [13], in that the parameter space is 
quantized and each 9 is identified wifh its optimal code Cg. The novel element here is the use of minimum-distance 
estimation instead of maximum-likelihood estimation, which is responsible for the appearance of the VC dimension. 

Remark 3.4. The boundedness of the distortion measure has been assumed mostly in order to ensure that the main 
idea behind the proof is not obscured by technical details. In Section IIV-DI we present an extension to distortion 
measures that satisfy a moment condition with respect to a fixed reference letfer in fhe reproducfion alphabef. In 
fhaf case, fhe paramefer estimation fidelify and fhe per-leffer overhead rafe sfill converge fo zero as 0(\/n“^ logn) 
and 0 {n~^ logn), respectively, buf fhe distortion redundancy converges more slowly as logn). 

Remark 3.5. Essentially the same convergence rates, up to multiplicative and/or additive constants, can be obtained 
if the memory length is taken to be some fraction of the block length n: m = an for some a G (0,1). 

Remark 3.6. Eet us compare the local Eipschitz condition of Theorem 13.21 to the corresponding smoothness condi¬ 
tions of Rissanen [1] for lossless codes and of Chou et al. [13] for quantizers. In the lossless case, X is finite or count¬ 
ably infinite, and the smoothness condition is for the relative entropies D{Pri\\Pg) = Pvi^) /peix)), 

where pg and prj are the corresponding probability mass functions, to be locally quadratic in 6 : D{Pri\\Pg) < 
mg\\9 — for some constant mg and for all i] in some open neighborhood of 0. Pinsker’s inequality D{Pri\\Pg) > 
dY{Pr^, Pg)/2 In 2 [25, p. 58] then implies the local Eipschitz property for dv{Pe, Pij), although the magnitude of 
the Eipschitz constant is not uniform in 9. Now, D{Prj\\Pg) is also the redundancy of the optimum lossless code 
for Pg relative to Thus, Rissanen’s smoothness condition can be interpreted either in the context of source 
models or in the context of coding schemes and their redundancies. The latter interpretation has been extended to 
quantizers in [13], where it was required that the redundancies S'^{Cg) be locally quadratic in 9. However, because 
here we are interested in joint modeling and coding, we impose a smoothness condition on the source distributions. 
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rather than on the codes. The variational distance is more appropriate here than the relative entropy because, for 
hounded distortion functions, it is a natural measure of redundancy for lossy codes [9]. 

Remark 3. 7. The Vapnik-Chervonenkis dimension of a given class of measurable subsets of (provided it is 
finite) is, in a sense, a logarithmic measure of the combinatorial “richness” of the class for the purposes of learning 
from empirical data. For many parametric families of probability densities, the VC dimension of the corresponding 
Yatracos class is polynomial in k, the dimension of the parameter space (see [16] for detailed examples). 

Remark 3.8. Instead of the Vapnik-Chervonenkis condition, we could have required that the class of sources {P^ : 
0 e 0} be totally bounded with respect to the variational distance. (Totally bounded classes, with respect to either 
the variational distance or its generalizations, such as the p-distance [26], have, in fact, been extensively used in 
the theory of universal lossy codes [9].) This was precisely the assumption made in the paper of Yatracos [24] on 
density estimation, which in turn inspired the work of Devroye and Lugosi [17], [18]. The main result of Yatracos 
is that, if the class {Pq : 0 G 0} is totally bounded under the variational distance, then for any e > 0 there exists 
an estimator 6* = where is an i.i.d. sample from one of the Pe’s, such that 

Eg[dv{Pe, Pe-{x^))] < 3e + 

where is the metric entropy, or Kolmogorov e-entropy [27], of {Pg : 6 G 0}, i.e., the logarithm of the 
cardinality of the minimal e-net for {P^} under dv{-,-)- Thus, if we choose e = e^ such that —> 0 as 

n ^ oo, then 6*{X'^) is a consistent estimator of 9. However, totally bounded classes have certain drawbacks. 
For example, depending on the structure and the complexity of the class, the Kolmogorov e-entropy may vary 
rather drastically from a polynomial in log(l/e) for “small” parametric families (e.g., finite mixture families) to 
a polynomial in 1/e for nonparametric families (e.g., monotone densities on the hypercube or smoothness classes 
such as Sobolev spaces). One can even construct extreme examples of nonparametric families with exponential 
in 1/e. (For details, the reader is invited to consult Ch. 7 of [19].) Thus, in sharp contrast to VC classes for which 
we can obtain 0(\/n“^ log n) convergence rates both for parameter estimates and for distortion redundancies, the 
performance of joint universal coding and modeling schemes for totally bounded classes of sources will depend 
rather strongly on the metric properties of the class. Additionally, although in the totally bounded case there is 
no need for quantizing the parameter space, one has to construct an e-net for each given class, which is often an 
intractable problem. 


IV. Comparisons and extensions 
A. Comparison with nearest-neighbor and omniscient first-stage encoders 

The two-stage universal quantizer of Chou, Effros and Gray [13] has zero memory and works as follows. Given a 
collection {C~} of n-block codes, the first-stage encoder is given by the “nearest-neighbor” map 

7*(x’') = argminp(x”,C'T(x")), 
s£S 

where the term “nearest-neighbor” is used in the sense that the code C~, , encodes x” with the smallest 
instantaneous distortion among all C~’s. Accordingly, the average per-letter distortion of the resulting two-stage 
code C” on the source Pg is given by 

Dg{C:) = - [ mmp{x^,C^ix^))dPgix^). 

n J ses 

Although such a code is easily implemented in practice, its theoretical analysis is quite complicated. However, the 
performance of can be upper-bounded if the nearest-neighbor first-stage encoder is replaced by the so-called 
omniscient first-stage encoder, which has direct access to the source parameter 0 G 0, rather than to x”. This latter 
encoder is obviously not achievable in practice, but is easily seen to do no better than the nearest-neighbor one. 

This approach can be straightforwardly adapted to the setting of our Theorem 13.21 except that we no longer require 
Condition 3). In that case, it is apparent that the sequence {G”} of the two-stage n-block (zero-memory) codes 
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(4.16) 


with nearest-neighbor (or omniscient) first-stage encoders is such that 

"log n 


and 


R{C:) = R + 0 

= O 


n 


log n 


re 


(4.17) 


Comparing (14.161) and (14.171) with (13.41) and (13.51) . we immediately see that the use of memory and direct parameter 
estimation has no effect on rate or on distortion. However, our scheme uses the (9 (log re) overhead bits in a more 
efficient manner — indeed, the bits produced by the nearest-neighbor first-stage encoder merely tell the second-stage 
encoder and the decoder which quantizer to use, but there is, in general, no guarantee that the nearest-neighbor 
code for a given x" G will be matched to the actual source in an average sense. By contrast, the first-stage 
description under our scheme, while requiring essentially the same number of extra bits, can be used to identify 
the acive source up to a variational ball of radius 0(\/re“^ log re), with probability arbitrarily close to one. 


B. Comparison with schemes based on codebook transmission 

Another two-stage scheme, due to Linder, Lugosi and Zeger [10], [11], yields weakly minimax universal codes for 
all real i.i.d. sources with bounded support, with respect to the squared-error distortion. The main feature of their 
approach is that, instead of constraining the first-stage encoder to choose from a collection of preselected codes, 
they encode each re-block x” G by designing, in real time, an optimal quantizer for the empirical distribution 
Px«, whose codevectors are then quantized to some carefully chosen resolution. Then, in the second stage, x'^ 
is quantized with this “quantized quantizer,” and a binary description of the quantized codevectors is transmitted 
together with the second-stage description of x"^. The overhead needed to transmit the quantized codewords is 
0{n~^ log re) bits per letter, while the distortion redundancy converges to zero at a rate logre). 

In order to draw a comparison with the results presented here, let {Pq : 0 G 0} be a class of real i.i.d. sources 
satisfying Conditions l)-3) of Theorem 13.21 and with support contained in some closed interval [—B,B], i.e., 
Pe{\X\ < B] = I for all 9 G 0. Let also A = M, and consider the squared-error distortion p{x,x) = |x — xp. 
Without loss of generality, we may assume that the optimal re-block quantizers Cg have nearest-neighbor encoders, 
which in turn allows us to limit our consideration only to those quantizers whose codevectors have all their 
components in [—B,B]. Then p is bounded with Pmax = 4i?^, and Theorem 13.21 guarantees the existence of a 
weakly minimax universal sequence of (re, re)-block codes satisfying (13.4b and (13.5b . Comparing this with 

the results of Linder et al. quoted in the preceding paragraph, we see that, as far as the rate and the distortion 
redundancy go, our scheme performs as well as that of [10], [11], but, again, in our case the extra (9(log re) bits 
have been utilized more efficiently, enabling the decoder to identify the active source with good precision. However, 
the big difference between our code and that of Linder et al. is that the class of sources considered by them is fully 
nonparametric, whereas our development requires that the sources belong to a compactly parametrized family. 


C. Extension to curved parametric families 

We can also consider parameter spaces that are more general than bounded subsets of For instance, in 
information geometry [28] one often encounters curved parametric families, i.e., families {Pg : 0 G 0} of probability 
distributions where the parameter space 0 is a smooth compact manifold. Roughly speaking, an abstract set 0 is 
a smooth compact manifold of dimension k if it admits a covering by finitely many sets Gi C Q, such that for 
each I there exists a one-to-one map of Gi onto a precompact subset Fi of the maps are also required to 
satisfy a certain smooth compatibility condition, but we need not consider it here. The pairs {Gi,^i) are called the 
charts of 0. 

In order to cover this case, we need to make the following modifications in the statement and in the proof of 
Theorem 13.21 First of all, let {Pg : 0 G 0} satisfy Condition 3) of the theorem, and replace Condition 2) with 
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2a) For each I, the map u i—> u G Fi, is uniformly locally Lipschitz: there exist constants ri > 0 and 

mi > 0, such that for every u £ Fi, 


for all w £ -Br, (^)- 

[Note that £ Gi C Q for all u £ Fi.] Condition 1) is satisfied for each Fi hy definition of 0. Next, we need 

to modify the first-stage encoder. For each I, quantize Fi in cubes of side so that each u £ Fi can he 

encoded into A:(logn^/^ -|- log J/) hits, for some J/, and reproduced hy some u £ Fi satisfying ||tt — tt|| < \Jkln. 
Then Q = ii and Q = ii both lie in Gi C 0. Now, when the first-stage encoder computes the minimum- 
distance estimate Q of the active source 0, it will prepend a fixed-length binary description of the index I such that 
9 £ Gi to the binary description of the cube in containing u = ^i{9). Let u be the reproduction of u under the 
cubic quantizer for Fi. The per-letter rate of the resulting two-stage code is 


= R + 0 



bits per letter. The nth-order distortion redundancy is bounded as 


where 9 = 
estimate. 


S^iG^n < 4pmax {MMPe^Pe)] + ^dviP^, P^)]} , 


The first term in the brackets is upper-bounded by means of the usual Vapnik-Chervonenkis 


Ee [dviPe,Pg)] < 


logn 3 
n 2n’ 


while the second term is handled using Condition 2a). Specifically, if 0 G Gi, fhen and 

Then fhe same argumenf as in fhe proof of Theorem 13.21 can be used fo show fhaf fhere exisfs a consfanf 5/ > 0 
such that dviP 0 ,Pg) < h\/^ n, which can be further bounded by bs/kjn with b = max/ 6/. Combining all these 
bounds and using Proposition 13.11 we get that the distortion redundancy is 





This establishes that {C^’^} is weakly minimax universal for the curved parametric family {Pg : 9 £ 0}. The 
fidelity of the source identification procedure is similar to that in the ’’fiat” case 0 C R*^, by the same Borel-Cantelli 
arguments as in the proof of Theorem 13.21 


D. Extension to unbounded distortion measures 

In this section we show that the boundedness condition on the distortion measure can be relaxed, so that our 
approach can work with any distortion measure satisfying a certain moment condition, except that the distortion 
redundancy will converge to zero at a slower rate of logn) instead of logn), as in the bounded 

case. 

Specifically, lef {Pg : 0 G 0} be a family of i.i.d. sources salysfing fhe conditions of Theorem 13.21 and lef 
p : T” X Tf —> R"*" be a single-letter disforfion function for which fhere exisfs a reference letter £ X such fhaf 

I p‘^{x,a^,)dPg{x) < G < oo (4.18) 

J X 

for all 0 G 0, and which has fhe form p{x, x) = [d{x, xff for some p, where d{-, ■) is a metric on X U X. In the 
following, we shall show that for any rate P > 0 satisfying 

D{P, 0) = supDg{P) < oo, 

6»ee 
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(4.19) 


and for any e > 0 there exists a sequence of two-stage (n, n)-block codes, such that 

' log n' 


and 


R^Cn,n) 


5e{C^’^) <e + 0 


n 


/logn 


n 


(4.20) 


for every 0 e 0. Taking a cue from Garcia-Munoz and Neuhoff [29], we shall call a sequence of codes {C”’"} 
satisfying 

lim <R + e 


and 


lim < e, V0 E 0 


for a given e > 0 e-weakly minimax universal for : 0 E 0}. By continuity, the existence of e-weakly minimax 
universal codes for all e > 0 then implies the existence of weakly minimax universal codes in the sense of 
Definition 12.21 Moreover, we shall show that the convergence rate of the source identification procedure is the same 
as in the case of a hounded distortion function, namely 0{y/n~^ logn); in particular, the constant implicit in the 
O(-) notation depends neither on e nor on the behavior of p. 

The proof below draws upon some ideas of Dobrushin [30], the difference being that he considered robust, rather 
than universal, codes3 Let M > 0 be a constant to be specified later, and define a single-letter distortion function 
Pm : T” X T" — > M+ by 

p{x,x), if p{x,x) < M 
M, if p{x, x) > M 


Pm{x,x) = 


Let 2 ) 0 ( 6 '”’™) denote the average per-letter pM-distortion of an (n,m)-block code C”’™ with respect to Pq, and 
let Dg{R) denote the corresponding Shannon DRF. Then Theorem 13.21 guarantees that for every i? > 0 there exists 
a weakly minimax universal sequence of two-stage (n, n)-block codes, such that 


P(C”’”) = R + 0 


logn 


n 


and 


for all (9 E 0. 


Dg{C^^'-)=Dg{R)+0 


'logn 


n 


(4.21) 


(4.22) 


We shall now modify C”’” to obtain a new code C”’”. Fix some (5 > 0, to be chosen later. Let {C~ : s E 5} be 
the collection of the second-stage codes of C”’”. Fix J € S and let fj- be the reproduction codebook of C~. Let 
Fj C T”” be the set consisting of (a) all codevectors in f^-, (b) all vectors obtained by replacing \5n\ or fewer 
components of each codevector in Fj- with a*, and (c) the vector a”. The size of T^ can be estimated by means of 
Stirling’s formula as 

L<5nJ 

X] ( ^ |f_|2»^[ft('5)+o(l)] + 

j=0 


ir~i = IT- 

-L c -L c 


where h{5) = —dlogd — (1 — (5)log(l — d) is the binary entropy function. Since iFsl = 2”^, we can choose 6 
small enough so that 


IFtI < 


(4.23) 


sequence of lossy codes is (strongly) robust for a given class of information sources at rate R (see, e.g., [31]-[33]) if its asymptotic 
performance on each source in the class is no worse than the supremum of the distortion-rate functions of all the sources in the class at R. 
Neuhoff and Garcia-Munoz [33] have shown that strongly robust codes occur more widely than strongly minimax universal codes, hut less 
widely than weakly minimax universal ones. 
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Now, if C~ maps a given x” e to = (xi, • • • , x„) G X"-, define a new string x” = (xi, • • • , x„) G X^ as 
follows. If |{1 < i < n : p{xi,Xi) > M}\ < 6n, let 


A 

Xi = 


Xi, if p{xi,Xi)<M 

a*, if p{xi,Xi) > M ’ 


otherwise, let x” = a*. Now, construct a new code C~ with the codehook and with the encoder and the decoder 
defined in such a way that C~{x'^) = x” whenever C~{x'^) = x”. Finally, let he a two-stage code with the 
same first-stage encoder as C”’”, hut with the collection of the second-stage codes replaced hy {C~}. From (14.231) 
it follows that R{C'^’'^) < -|- e. Since R{C^’'^) = R + 0{n~^ logn), we have that 

R{C^’^) <R + e + 0 • (4.24) 

Furthermore, the code C7”’" has the following property: 

Lemma 4.1. Let G' = G{1 + 2/5). Then for any 0 G 0, 

De{G^n < DeiC^n + (4-25) 


Proof: See Appendix iDl 
Substituting (14.22b into (14.25b . we have that 

De{G^n < De{R) + O 



+ 


%De{R) + 0 



(4.26) 


Now, since pm{x,x) < p{x,x) for all (x,x) e X x X, Dg{R) < Dg{R) for all 0 G 0. Using this fact and the 
inequality ^/a + b < y/a + y/h, we can write 


D0{G^’^)<De{R)+O 



^D,(R)+0 



Upon choosing M so that G'D{R, 0) /M < e, we get 

. (4.27) 

n j 

Thus, (14.24b and (14.27b prove the claim made at the beginning of the section. Moreover, because the first-stage 
encoder of 67”’” is the same as in U”’”, our code modification procedure has no effect on parameter estimation, so 
the same arguments as in the end of the proof of Theorem 13.21 can be used to show that the decoder can identify 
the source in effect up to a variational ball of radius 0(y/n“^ logn) asymptotically almost surely. 


50 (C”’”) <e + 0 


V. Examples 

In this section we present a detailed analysis of two classes of parametric sources that meet Conditions l)-3) of 
Theorem [321 and thus admit schemes for joint universal lossy coding and modeling. These are finite mixture classes 
and exponential families, which are widely used in statistical modeling, both in theory and in practice (see, e.g., 
[34]-[37]). 
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A. Mixture classes 


Let pi, • • • ,Pk be fixed probability densities on a measurable X C and let 

0 = = (01,... ,0fc) G : 0 < 0i < 1,1 < i < = l| 

be the probability /c-simplex. Then the mixture class defined by fhe pj’s consisfs of all densities of the form 

k 

Pe{x) = y^^6iPi{x). 

i=l 

The parameter space 0 is obviously compact, which establishes Condition 1) of Theorem 13.21 In order to show 
that Condition 2) holds, fix any G 0. Then 

dviPe,Prj) = t; I \pe{x) - Pr]{x)\dx 
^ Jx 


< 


/ \di - Vi\Pi{x)dx 

i=i 


1 




2=1 


< 


\/k 

~2 


\ - 'dif 

\ i=l 


where fhe last inequality is a consequence of the concavity of the square root. This implies that the map 0 Pg is 
everywhere Lipschitz with Lipschitz constant \/fe/2. We have left to show that Condition 3) of Theorem 13.21 holds 
as well, i.e., that the Yatracos class 

Aq = I Ae,,, = {x £ X ■. pe{x) > Pr^ix)} : 0, ?? G 0; 6» / 
has finite Vapnik-Chervonenkis dimension. To this end, observe that x G A^i ,, if and only if 


- pi)pi{x) > 0 . 


i=l 


Thus, A .0 consists of sets of the form 


x 


G Y : '^aiPi{x) > 0 ,a = (ai, • • • ,ak) 


2 = 1 


Since the functions pi,--- ,Pk span a linear space whose dimension is not larger than k, Lemma |A3] in the 
Appendices guarantees that V(A.e) < k, which establishes Condition 3). 


B. Exponential families 

Let A” be a measurable subset of R*^, and let 0 be a compact subset of R^. A family {pe : 
densities on X is an exponential family [28], [35] if each pg has the form 

Pg{x) = p{x) exp l^^6ihi{x) - g{e)^ 


G 0} of probability 


(5.28) 
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where p is a fixed reference density, hi, - ■ ■ ,hk are fixed real-valued functions on X, and 

g{ 6 ) =ln f e^'^^^^p{x)dx 


lx 


is the normalization constant. By way of notation, h{x) = {hi{x),--- ,hk{x)) and 9 ■ h{x) = 

Given the densities p and pQ, let P and Pq denote the corresponding distributions. The assumed compactness of 
0 guarantees that the family {Pg : 0 G 0} satisfies Condition 1) of Theorem 13.21 In the following, we shall 
demonstrate that Conditions 2) and 3) can also he met under certain regularity assumptions. 

It is customary to choose the functions hi in such a way that {l,hi, - ■ ■ ,hk} is a linearly independent set. This 
guarantees that the map 0 i—> is one-to-one. We shall also assume that each hi is square-integrahle with respect 
to P: " f 

hfdP = / hf{x)p{x)dx < oo, 1 < i < k. 

Jx 


L 


Then the {k -I- 1)-dimensional real linear space P C L‘^{X,P) spanned hy 
an inner product 

{f,g)= [ fgdP, f,gGP 


,hk} can he equipped with 


lx 


and the corresponding L 2 norm 


Also let 


2 = y/{U)^X PdP, /GP. 


lx 


= inf {M : \ f{x)\ < M P-a.e.| 
denote the Poo norm of /. Since P is finite-dimensional, there exists a constant > 0 such that 

I < ^fcll/lb- 


Finally, assume that the logarithms of Radon-Nikodym derivatives dP/dPg = p/pe are uniformly hounded P-a.e.: 
sup II logp/p6»||oo < OO. Let 


6160 


Dmp^) = 


dPe , dPg 
■ in ■ 




Pv 


denote the relative entropy (information divergence) between Pq and P^. Then we have the following basic estimate: 
Lemma 5.1. For all 0,r/ G 0, 

D{Pe\\Pr,) < lell'"f/t’^ll”e2^''ll^-^ll||0-r?||2, 

where || • || is the Euclidean norm on 

Proof: The proof is along the lines of Barron and Sheu [35, Lemma 5]. Without loss of generality, we may 
assume that the functions {/iq, hi, - ■ ■ , h^}, ho = 1, form an orthonormal set with respect to P: 


{hi,hj) = / hihjdP = Sij, 0 < i,j < k. 


lx 


Then 


Now, since 


Wie - rj) - hh = 


- 'ni)hi 


Z=1 


= ||0-r?||. 


g{ri) — g(9) = In f ^^'^dPe = In f ^^PPp 0 {x)dx, 
Jx Jx 
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we have 


b(^) - 9{0)\ < ||(r? - 0) • /i||oo < Ak\\{r] - 9) ■ h \\2 = Ak\\r] - ^H- 


Furthermore, 

In ^ = ((9 - 7/) • /i + g{r]) - g{e), 

Pv 

whence it follows that the logarithm of the Radon-Nikodym derivative dPg/dPri = Pe/Prj is hounded P-a.e.: 
II \npg/pri\\ca < 2\\{r] — 6) -/iHoo < “^AkWr] — 6\\. In this case, the relative entropy D{Pg\\P.,j) satisfies [35, Lemma 1] 

[ (in^-c) dPg 
2 Jx \ Prj J 

for any constant c. Choosing c = g{g) — g{9) and using the orthonormality of the hi, we get 


= -e 


1 

2 * 

1 

—( 
2 

1 

2 

1 


\\ie-7jyh\\. 

” / {{0 

- v) ■ 

hfdPg 


Jx 



\\{e-gyh\i 

O. f Pi 

((0- 

p) • hf dP 


Jx P 



II lnp/pe||o 

og2A,||0- 

-^1111(0 

1 

to to 

II lnp/pe||o 

og2A,||e- 

-.Illl0. 

-9\\\ 


and the lemma is proved. ■ 

Now, using Pinsker’s inequality dv{Pg, Pg) < {l/2)D{Pg\\P^) [38, Lemma 5.2.8] together with the above lemma 

and the assumed uniform boundedness of lup/pg, we get the bound 

dv{Pe.Pg)<moe^'‘\\^-^H9-gl 9,rjeG, (5.29) 

where mo = | exp I ^ sup || lnp/pg\\oo ). If we fix 0 e 0, fhen from (15.291 ) if follows fhaf, for any r > 0, 

V eee / 

dviPe, Pv) < moe^'“'’||0 - pH 


for all p satisfying ||p — 0|| < r. That is, the family {Pg : 0 G 0} satisfies fhe uniform local Lipschifz condition 
[Condition 2) of Theorem 13.21 . and the magnitude of the Lipschitz constant can be controlled by tuning r. 

All we have left to show is that the Vapnik-Chervonenkis condition [Condition 3) of Theorem 13.21 is satisfied. 
Lef 0,p G 0 be distinct; then pg{x) > Pri{x) if and only if {9 — rj) ■ h{x) > g{9) — g{r]). Thus, the corresponding 
Yatracos class consists of sets of the form 

k 

X £ X : uq + aihi(x) > 0, a = (ao; oi, • • • , Ok) G 

i=l 

Since the functions 1, /ii, • • • ,hk span a (/c + 1)-dimensional linear space, V(A 0 ) < A: + 1 by Lemma Pk.3l 



VI. Summary and discussion 

We have constructed and analyzed a scheme for universal fixed-rate lossy coding of continuous-alphabet i.i.d. 
sources based on a forward relation between statistical modeling and universal coding, in the spirit of Rissanen’s 
achievability theorem [1, Theorem lb] (see also Theorem 2 in [13]). To the best of our knowledge, such a joint 
universal source coding and source modeling scheme has not been constructed before, although Chou et al. [13] 
have demonstrated the existence of universal vector quantizers whose Lagrangian redundancies converge to zero 
at the same rate as the corresponding redundancies in Rissanen’s achievability theorem for the lossless case. What 
we have shown is that, for a wide class of bounded distortion measures and for any compactly parametrized 
family of i.i.d. sources with absolutely continuous distributions satisfying a smoothness condition and a Vapnik- 
Chervonenkis learnability condition, the tasks of parameter estimation (statistical modeling) and universal lossy 
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coding can be accomplished joindy in a two-stage set-up, with the overhead per-letter rate and the distortion 
redundancy converging to zero as 0(n“^ log n) and 0(y/ n~^ log n), respectively, as the block length n tends to 
infinity, and the extra bits generated by the first-stage encoder can be used to identify the active source up to a 
variational ball of radius logn) (a.s.). We have compared our scheme with several existing schemes for 

universal vector quantization and demonstrated that our approach offers essentially similar performance in terms of 
rate and distortion, while also allowing the decoder to reconstruct the statistics of the source with good precision. 
We have described an extension of our scheme to unbounded distortion measures satisfying a moment condition 
with respect to a reference letter, which suffers no change in overhead rate or in source estimation fidelity, although 
it gives a slower, 0{^n~^ logn), convergence rate for distortion redundancies. Finally, we have presented detailed 
examples of parametric sources satisfying the conditions of our Theorem 13.21 (namely, finite mixture classes and 
exponential families) and thus admitting schemes for joint universal quantization and modeling. 

As mentioned in the Introduction, in treating universal lossy source coding as a statistical problem the term “model” 
can refer either to a probabilistic description of the source or to a probabilistic description of a rate-distortion 
codebook. In fact, as shown by Kontoyiannis and Zhang [39], for variable-rate lossy codes operating under a 
fixed distortion constraint, there is a one-to-one correspondence between codes and discrete distributions over 
sequences in the reproduction space (satisfying suitable “admissibility” conditions), which they dubbed the “lossy 
Kraft inequality.” The same paper also demonstrated the existence of variable-rate universal lossy codes for finite- 
alphabet memoryless sources with rate redundancy converging to zero as (/c/2) log n/n, where k is the dimension 
of the simplex of probability distributions on the reproduction alphabet. Yang and Zhang [40] proved an analogous 
result for fixed-rate universal lossy codes and showed furthermore that the (/c/2) log n/n convergence rate is optimal 
in a certain sense. (The redundancies in our scheme are therefore suboptimal, as can be seen from comparing them 
to these bounds, as well as to those of Chou et al. [13]. It is certainly an interesting open problem to determine 
lower bounds on the redundancies in the setting of joint source coding and identification.) These papers, together 
with the work of Madiman, Harrison and Kontoyiannis [41], [42], can be thought of as generalizing Rissanen’s 
MDL principle to lossy setting, provided that the term “model” is understood to refer to probability distributions 
over codebooks in the reproduction space. 

We close by outlining several potential directions for further research. First of all, it would be of both theoretical 
and practical interest to extend the results presented here to sources with memory in order to allow more realistic 
source models such as autoregressive or Markov sources, and to variable-rate codes, so that unbounded parameter 
spaces could be accommodated. We have made some initial progress in this direction in [43], [44], where we 
constructed joint schemes for variable-rate universal lossy coding and identification of stationary ergodic sources 
satisfying a certain mixing condition. Moreover, the theory presented here needs to be tested in practical settings, 
one promising area for applications being media forensics [45], where the parameter 9 could represent traces or 
“evidence” of some prior processing performed, say, on an image or on a video sequence, and where the goal is 
to design an efficient system for compressing the data for the purposes of transmission or storage in such a way 
that the evidence can be later recovered from the compressed signal with minimal degradation in fidelity. 
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Appendix A 

Vapnik-Chervonenkis theory 

In this appendix, we summarize, for the reader’s convenience, some basic concepts and results of the Vapnik- 
Chervonenkis theory. A detailed treatment can be found, e.g., in [19]. 

Definition A.l (shatter coefficient). Eet A be an arbitrary collection of measurable subsets of Given an n-tuple 
x'^ = (xi, • • • , Xn) G (R'^)’^, let A{x'^) be the subset of {0,1}” obtained by listing all distinct binary strings of the 
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form • • • ) 1 {x„gA}) A is varied over A. Then 


Syi(re) = max |^(x")| 
x'‘e(R‘^)" 


is called the nth shatter coefficient of A. 


Definition A.2 (VC dimension; VC class). The largest integer n for which S^(n) = 2”^ is called the Vapnik- 
Chervonenkis dimension (or the VC dimension, for short) of A and denoted hy V(,A). If S_ 4 (n) = 2"^ for all 
n = 1,2 ,-then we define V(^) = oo. If V(,A) < oo, we say that ,A is a Vapnik-Chervonenkis class (or VC 
class). 

The basic result of Vapnik-Chervonenkis theory relates the shatter coefficient S^(n) to uniform deviations of the 
prohahities of events in A from their relative frequencies with respect to an i.i.d. sample of size n: 


Lemma A.1 (the Vapnik-Chervonenkis inequalities). Let A he an arbitrary collection of measurable subsets of 
and let X” = (Xi,..., X„) be an n-tuple of i.i.d. random variables in with the common distribution P. Then 


sup \Px^{A) - P(^)| 
A£A 


> e 


< 8S^(n)e-”^'/32 


for any e > 0, and _ 

E I sup \PxAA) - P(^)|| < 

lAeA } V n 

where Px^ is the empirical distribution of X"': 


Px4B) = 


1 

n 


2=1 


(A.l) 


(A.2) 


for all Borel sets B C The probabilities and expectations are with respect to P. 

Now, if .A is a VC class and V(A.) > 2, then the results of Vapnik and Chervonenkis [46] and Sauer [47] imply 
that S^(n) < Plugging this bound into (lA.ll) and (IA.21) . we obtain the following: 


Lemma A.l. If A. is a VC class with V(A^) > 2, then 


P 


sup|Px"(A)-P(A)| >e 
Ag4 


< 8nV(A)g-neV32 


for any e > 0, and 

E j sup I Px^ (A) - P(A) 11 < 

Ue^ J V n 

where c is a constant that depends only on V(A). 


(A.3) 


(A.4) 


Remark A.1. One can use more delicate arguments involving metric entropies and covering numbers, along the 
lines of Dudley [48], to improve the bound in (IA.4b to dsJXjn, where c' = c'(V(A)) is another constant. However, 
d turns out to be much larger than c, so that, for all “practical” values of n, the ’’improved” n) bound is 

much worse than the original 0{\/nffiAo^) bound. 

Lemma A.3. Let T be an m-dimensional linear space of real-valued functions on Then the class 

•A = |{x : /(x) > 0} : / e 

is a VC class, and V(A) < m. 
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Appendix B 

Proof of Proposition 13. II 


Fix 9 £ Q, and let X be distributed according to Pq. Let the distortion function p satisfy Condition 1) of 
Proposition 13.11 Then a result of Csiszar [23] says that, for each point {R, D 0 {R)y on the distortion-rate curve 
for Pq, there exists a random variable Y with values in the reproduction alphabet Y, where the joint distribution 
of X and Y is such that 

I{X,Y) = R and Exy[p{X,Y)] = De{R), (B.l) 


and the Radon-Nikodym derivative 

where Px = Pe, has the parametric form 
where s > 0 and a{x) > 1 satisfy 


A dPxY , ^ 

a{x,y) = a{x) 2 -^P^^^y\ 


[ a{x)2-P^^’y^dPeix) <1, Vy E X, 

Jx 


(B.2) 


(B.3) 


and — 1/s = Dg{R), the derivative of the DRF Dq^R) at R, i.e., —1/s is the slope of the tangent to the graph of 
Do{R) at R. 

Next, let N = where 5 > 0 will be specified later, and generate a random codebook W as a vector 

W = (VFi, • • • , Wn), where each Wi = {Wn, • • • , Win) £ and the IFj/s are i.i.d. according to Py. Thus, 

N 

Pw=XP^ 

i=l 

is the probability distribution for the randomly selected codebook. We also assume that W is independent from 
X"- = (Ai,--- ,Xn)- Now, let Cyv be a (random) n-block code with the reproduction codebook W and the 

minimum-distortion encoder, so that p(x^,C)x{x'^)) = pix^,W), where p{x^,W) = min p{x^,Wi). Then the 

l<i<N 

average per-letter distortion of this random code over the codebook generation and the source sequence is 

An = J Dg{C^)dP)Yiw) 

= - f Eg [p(A’",w)] dPvv(w) 

n J 

p{x'^,w)dPg{x'^)'] dPw{w). 


1 

n 




Using standard arguments (see, e.g., Gallager’s proof of the source coding theorem [22, Ch. 9]), we can bound A 
from above as 


where 


and 


An Y Dq{R) -|- 5 -|- Pmax (^Pxy{Y^ 0 Sx^) + ' 

•Sx’' = jy"" e : p{x'^, y”) < n{Dg{R) + (5) and y") < n{R + (5/2)| 


(B.4) 


in{x^,y'^) = '^\oga{xi,yi) = \oga{x^,y^). 
i=l 

is the sample mutual information. Here, the pairs {Xi,Yi) are i.i.d. according to Pxy- Now, by the union bound, 

Pxy{Y^ ^Sx^)<PxY (-^loga{Xi,Yi) > R + 6 / 2 ] + Pxy (-^piXi,Yi)>De{R)+5] . (B.5) 


i=l 


2 = 1 
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Note that from (IB.ll ) we have that Exy[loga(X, 1")] = I{X,Y) = R and E^y^)] = Dg{R). Since 0 < 
p{X, Y) < yOmax. the second prohahility on the right-hand side of (IB.5I) can he hounded using Hoeffding’s inequality 
[49], which states that for i.i.d. random variables Si, • • • ,Sn satisfying a < Si < b a.s., 

P ^ > E[5i] + 

This yields the estimate 

PXY (^lYlpiXi, Y) > Dg{R) + 6^ < . (B.6) 

In order to apply Hoeffding’s inequality to the first prohahility on the right-hand side of (IB.51 ). we have to show that 
loga(X, y) is hounded. From (IB.21) we have that loga(x,y) = loga(x) — sp{x,y). On the other hand, integrating 
both sides of (IB.2I ) with respect to Py, we get 

= S2-M^l)dPy{y) 

Since 2“^^”“ < < 1, we have that 1 < a{x) < 2^^”“*, whence it follows that —spmax < loga(a:,y) < 

■S/Omax- Thus, by Hoeffding’s inequality, 

PxY Yi)>R + 6/2^ < (B.7) 

Putting together (IB.4I) . (IB.6b and (IB.Vb . and using the fact that N > 2"(^+‘^) — 1, we obtain 

An < Dg{R) + J ^ ^ 

Since An is the average of the expected distortion over the random choice of codes, it follows that there exists at 
least one code whose average distortion with respect to Pg is smaller than An- Thus, 

De{R + 5)<De{R) + 5 + p^^^ I ^ 


Now, let c<j = max(pmax/2, 2sp ma x) and put 6 = Cs\/n~^ Inn to get 

+ -ng(R) = (c, + o(l))^^. (B.8) 

Because —1/s is the slope of the tangent to the distortion-rate curve at the point {R, Dq{R)), and because Dq{R) 
is nonincreasing in R, we have —1/s' < —1/s for s' corresponding to another point {R',Dq{R')) with R' < R. 
Thus, Cs' < Cg, and (IB.81) remains valid for all R' < R. Thus, let R' = R — CsVn~^ Inn to get 

D^{R) -De(^R- < (cs + o{l))^f^. 

Therefore, expanding Dg{R) in a Taylor series to first order and recalling that —1/s = Dg{R), we see that 

D^{R)=Cg (^1 + 1 + 0 ( 1 )) 

and the proposition is proved. 


Appendix C 

Quantizer mistmatch lemma 

Lemma C.l. Let P and Q be two absolutely continuous probability distributions on A C with respective 
densities p and q, and let p : A x Tf —> be a single-letter distortion measure having the form p{x, x) = [d{x, x)]^, 
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where p > 0 and d{-, •) is a bounded metric on XUX. Consider an n-block lossy code with the nearest-neighbor 
encoder, and let 

Dp{C^) = - Ep[p{X^, = - [ p{x^, C^{x^))dP{x^) 

n Jxr^ 


n 


(C.9) 


be the average per-letter distortion of C” with respect to P. Define Dq{C"') similarly. Then 

\Dp{C^Y/P - Dq{C^)^/p\ < 2^/Pd^^^dv{P,Q). 

Furthermore, the corresponding nth-order operational DRF’s Dp{R) and Dq{R) satisfy 

\DURy/P - D^{Ry/P\ < 2^/Pdn,^^dv{P, Q). (C.IO) 

Proof: The proof closely follows Gray, Neuhoff and Shields [26]. Let Vn{P, Q) denote the set of all probability 
measures on X"' x ff” having P” and Q” as marginals, and let p achieve (or come arbitrarily close to) the infimum 
in the Wasserstein metric 


Pn{P,Q)= inf 


1 


tJ.£Vn{P,Q) \n Jx'^xX 


p{x^,y^)dp{x\y^) 


i/p 


Suppose that Dp{C^) < Dq{C'^). Then, using the fact that d is a metric, Minkowski’s inequality, and the nearest- 
neighbor property of C”, we have 


Dp{C^f/P = (- [ p{x^,C^{x^))dP^{x^) 

\n Jxr. 


i/p 


1 


Pt Jx^xX^ 
1 f 


p{x^,C^{x^))d-p{x\y^) 


i/p 


< (-/ p{x^,y^)dJl{x^,y^)Y\(- [ 

nJx^xX^ ) KPtJx^xX" 

p4P,Q) + Dq(C-)Vp, 




i/p 


Now, 


vMG'Pi(P,Q) JXxX 

(see, e.g., [26], Section 2), and p{x,y) < dSiaxl{x 7 ^y}> so 


Pn{P, Q) = Pi{P, Q) = inf 


p{x,y)dp{x,y) 


lip 


inf / 

li&Pi{P,Q) JxxX 


p{x,y)dp{x,y) <d^^^ inf / l{^^y|dp(x, y). 

p&Vi{P,Q) JxxX 


The right-hand side of this expression is the well-known coupling characterization of twice the variational distance 
dv{P,Q) (see, e.g.. Section 1.5 of Lindvall [50]), so we obtain 

DpiC^y/P < DQiC'^f/P + 2^/Pd^^^dv{P, Q). 

Interchanging the roles of P and Q, we obtain (IC.91) . 

To prove (1C.101) . let achieve the nth-order optimum for P: Dp{C^) = Dp{R). Without loss of generality, we 
can assume that has a nearest-neighbor encoder. Then 

D^{R)^/P < Dq{C:)^/p < DpiC:)^/P + 2i/Pd^axdy(P, Q) = D^iR^/P + 2^/Pd^^,dv{P, Q). 

The other direction is proved similarly. ■ 
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Appendix D 
Proof of Lemma IaTI 

Fix a 0 e 0. Define the measurable set U = {{x'^,z'^) e A'” x , z^) = a*}. Then the distortion 

Dg{C"'’'^) can be split into two terms as 

1 f 


De{C^' 




p{x^,C^'^ix^,z^))dPeix^,z^‘ 


- [ p{x^, z^))dPe{x^, z^P- [ p{x^, z^))dPe{x^, z^), (D.l) 

nJu n Juc 


where the superscript c denotes set-theoretic complement. We shall prove the lemma by upper-bounding separately 
each of the two terms on the right-hand side of (ID.ll ). 

First of all, we have 

- [ = Dg{C^’^). (D.2) 

'IT' Ju ™ 

By construction of U and (x"‘,z'^) G U implies that at least 6n components of x'^ = C^’'^{x'^, z"") satisfy 

p{xi,Xi) > PI, so by definition of pM it follows that pm{x^, z^)) > n5M for all {x^,z'^) G U. Thus, 

- f Pm{x'^, z^))dPe{x^, z^) > 5M • PT x P^U), 

n Ju 

which, together with (ID.2I) . implies that 

DeiC^’^ 


P ^ X P ^{ U ) < 


6M 


(D.3) 


Using the Cauchy-Schwarz inequality, (14.18b . and (ID.3b . we can write 


1 


1 


- / p(x^C"’"(x^z”))dP,(x^z’^) = -Eg[piX^,a:)-lu] 
J hi ^ 


< ^ JP ^ X P-(^t)E0[p2(X-,a-)/n2] 


< 


2GDe{C'^’'^) 

Jm ^ 


(D.4) 


where the last inequality follows from the easily established fact that, for any n independent random variables 

,21 


Ui, • • • , K satisfying E[Ui] < G, E (i ViY 


< 2G. 


Now, z”) G implies that for each t = 1, • • • , n either Xi = Xj and pM{xi,Xi) = p{xi, Xj), or Xj = a* and 
p{xi,Xi) > PI, where x"" = (xi,--- ,Xn) = C”’”(x”, z"’) and x” = (xi,--- ,Xn) = C'"”"'(x"’, z"'). Then, by the 
union bound. 


-[ p(x^C"’’^(x^z’^))dPe(x^z’^) = -V / pix,,x,)dPeix^,z^) 


n Ju 




1 yY ' ■ f '' 1 ^ r 

<-y^/ PM{xi,^i)dPg{x'^,z"') +-'y' p{xi,a,,)dPg{. 


x^,zn 


The first term on the right-hand side of (ID.5b is bounded as 


n p 

V / PM{x^,ydPg{x^,Z^) < Dg{C^n- 


(D.5) 


(D.6) 
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As for the second term, we can once again invoke the Cauchy-Schwarz inequality and (14.181 ) to write 


-E / 

n ^ Ju. 


^ j=l 4{(x",2"):p(a;i,a;i)>M} 


1 ” /-Z- 

p{xi,a^)dPe{x'^,z"') < \JPe{{{x'^, z^) : p{xi,'^i) > M})Ee[p‘^{X,a^)] 


i=l 


\fG /- 

- —'^yPs{{{x^,z'^)-.p{xi,%)>M]). 


i=l 

Let us estimate the summation on the right-hand side of (ID.7I ). First of all, note that 

-V / ^ pM{xu'^i)dPe{x^,z^) < De{C^n- 

^ j{{x^,z'^):p{xi,Xi)>M} 

Now, p{xi,Xi) > M implies that pM{xi,Xi) = M, so that 

PM{xi,^i)dPe{x"‘,z'^) = MPgiHx"-, z"-) : p{xi,^i) > M}), 


J {{x" ,z"-):p{xi,Xi)>M} 

which, together with (lD.8b . yields the estimate 

n 


: p{xi,Xi) > M}) 


< 


i=l 


nDe{C^’^) 

M ^ 


whence hy the concavity of the square root it follows that 


'^y Pe{{{x^,z^) : p{xi,Xi) > M}) < n 


2=1 


DejC'^ 

M 


Substituting this hound into (ID.7I ) yields 


1 


E 


p{xi,a^)dPe{x'^,z‘^) < 


^ J{(x" ,z”-):p{xi,Xi)>M} 

The lemma is proved hy combining (ID.11 1. (ID.4I ). (ID.61 ). and (ID.91 ). 
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